Skip to content

add kitsune.l10n app for handling content localization #6330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

escattone
Copy link
Contributor

@escattone escattone commented Nov 1, 2024

mozilla/sumo#2053

Notes

This PR introduces a new kitsune.l10n application into SUMO that, for now, automatically creates and manages machine translations of KB articles. It uses an LLM to create the machine translation and, if a prior approved translation exists, the machine translation is heavily influenced by that prior translation. This respect for prior contributor translations was the requirement that drove this approach.

This new kitsune.l10n app is designed to be independent of the apps containing content that it localizes -- so far just the kitsune.wiki app. So, in other words, the kitsune.wiki app doesn't know anything about -- or, in other words, doesn't import anything from -- the kitsune.l10n app. If the kitsune.l10n app was removed, the kitsune.wiki app would continue functioning as usual, just without automatically-generated machine translations.

In general, the system is relatively simple and consists of two main components:

  • A real-time component that initiates a Celery task that creates and manages machine translations as needed for the specific KB article that's been updated with a new approved revision.
  • A periodic or "heartbeat" component -- whose interval is configurable -- that continuously initiates the same Celery task (in this case, with no argument) that creates and manages machine translations for all KB articles as needed. This "heartbeat" component typically just manages existing translations -- for example, rejecting machine translations that are outdated or approving machine translations that haven't been reviewed within the grace period -- but also acts as a backup to the real-time component should it fail for any reason.

Both of the main components above call the same Celery task handle_wiki_localization(), which in turn, uses the following two core functions:

  • create_machine_translations()
  • manage_existing_machine_translations()

Once the handle_wiki_localization() Celery task has started (in any Celery worker), it can not be run again (in any Celery worker) until it has finished. This is managed via a Postgres advisory lock, which must be acquired in order to start, and is released only upon normal completion or an exception. This is to prevent the possibility (although small) of creating duplicate machine translations, which could occur if two instances of the task run simultaneously.

All of the settings for machine translations can be made via the Django admin, and any changes take immediate effect. Currently, machine translations can be restricted by locale and/or by KB article slug and/or by the group of the KB article approver and/or by the approval date/time.

By default, this l10n application is disabled.

Local Testing

@akatsoulas @smithellis @emilghittasv -- All of you are already configured to impersonate the GKE dev service account, which provides access to the Vertex AI API.

  • Impersonate the GKE dev service account locally
    • gcloud auth application-default login --impersonate-service-account <gke-dev-sa-email> -- I can send you the email to use via Slack
    • Set your location to the root of the kitsune repo -- cd ~/repos/kitsune
    • Move the impersonated creds into the root -- cp -pr ~/.config/gcloud ./gcloud
  • Add the following to your .env file:
    • GOOGLE_APPLICATION_CREDENTIALS=./gcloud/application_default_credentials.json
    • GOOGLE_CLOUD_PROJECT=moz-fx-sumo-nonprod
  • Bring up your docker environment, run the DB migrations, etc.
  • Go into the admin and configure machine translations. You'll need to do the following:
    • Enable machine translation
    • Specify an LLM model name -- use gemini-1.5-pro-002
    • Select one or more locales
    • Adjust, keep the default, or clear the "approved after" date/time limitation
    • Add any other limitations as desired
image

Future Adjustments

  • Reporting -- Create new L10n reporting views/pages
  • Error handling -- Currently, if the act of creating any single machine translation (create_machine_translation()) raises an exception, any pending machine translations would be abandoned for the current run of the handle_wiki_localization() Celery task. Of course, that Celery task will be run again at the next heartbeat, so we'll try again later, but it's possible we may want to handle some exceptions in the future. The challenge is that it's difficult to tell from the source code what exceptions might be raised, so I think we can wait to see what Sentry events we get, if any, and decide then whether it makes sense to handle those. I should add that the invoke() method of LangChain's chat model already retries on common, recoverable API exceptions (it's currently configured to retry twice before giving up), so it may be that we don't need to add any LLM API exception handling at all, and we just allow exceptions to be raised and then reported by Sentry (which is the current approach).

Infrastructure Configuration Needed

TODO

  • Add ability to limit machine translation to revisions approved by users within a specific group.
  • Add tests to test_wiki.py that cover the slug, date, and group filtering.
  • Add whatever we need to cover our reporting needs.
  • Add the ability to exclude revisions created by the SUMO L10n Bot on the recent revisions page.
  • Record LLM service calls and make available in the Django admin
  • Record wiki revision activity and make available in the Django admin

@escattone escattone force-pushed the llm-l10n-poc branch 17 times, most recently from a288ac2 to f878ad8 Compare November 12, 2024 00:15
@escattone escattone force-pushed the llm-l10n-poc branch 7 times, most recently from fbb70d0 to 4250184 Compare November 13, 2024 00:02
@escattone escattone marked this pull request as ready for review November 13, 2024 00:23
@escattone escattone requested a review from akatsoulas November 13, 2024 00:23
@escattone escattone force-pushed the llm-l10n-poc branch 4 times, most recently from f92f439 to 86e2edb Compare November 15, 2024 00:35
@escattone escattone force-pushed the llm-l10n-poc branch 3 times, most recently from 8fa08a0 to 08e764f Compare January 28, 2025 22:21
@escattone escattone force-pushed the llm-l10n-poc branch 2 times, most recently from a074182 to 2dfcae2 Compare February 3, 2025 21:27
@escattone escattone force-pushed the llm-l10n-poc branch 2 times, most recently from 0f98201 to 1c18bb0 Compare February 10, 2025 17:46
@escattone escattone force-pushed the llm-l10n-poc branch 5 times, most recently from 35534fc to 822448c Compare March 6, 2025 22:53
@escattone escattone force-pushed the llm-l10n-poc branch 5 times, most recently from 19220df to 34e2486 Compare April 15, 2025 15:00
@escattone escattone force-pushed the llm-l10n-poc branch 3 times, most recently from bff98ae to 26ce8a7 Compare May 6, 2025 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants